A Technique for High Bandwidth and Deterministic Low Latency Load/Store Accesses to Multiple Cache Banks

نویسندگان

  • Henk Neefs
  • Hans Vandierendonck
  • Koen De Bosschere
چکیده

One of the problems in future processors will be the resource conflicts caused by several load/store units competing to access the same cache bank. The traditional approach for handling this case is by introducing buffers combined with a cross-bar. This approach suffers from (i) the nondeterministic latency of a load/store and (ii) the extra latency caused by the cross-bar and the buffer management. A deterministic latency is of the utmost importance for the forwarding mechanism of out-of-order processors because it enables back-to-back operation of instructions. We propose a technique by which we eliminate the buffers and crossbars from the critical path of the load/store execution. This results in both, a low and a deterministic latency. Our solution consists of predicting which bank is to be accessed. Only in the case of a wrong prediction a penalty results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load/Store Optimization

A high-bandwidth, low-latency load-store unit is a critical component of a dynamically scheduled processor. Unfortunately, it is also one of the most complex and non-scalable components. Recently, several researchers have proposed techniques that simplify the core load-store unit and improve its scalability in exchange for the in-order pre-retirement re-execution of some subset of the loads in ...

متن کامل

Improving Memory Access Performance Using a Code Coalescing Unit

High clock frequencies combined with deep pipelining employed by many of the state-of-the-art processors have forced cache hit accesses to be multi-cycle operations. For many programs, untolerated load latencies account for a signiicant portion of total execution time. In this paper, we present a mechanism called the Code Coalescing Unit (CCU) that can identify and eliminate at run-time several...

متن کامل

1 Memory Bank Predictors

Cache memories are commonly implemented through multiple memory banks to improve bandwidth and latency. The early knowledge of the data cache bank that an instruction will access can help to improve the performance in several ways. One scenario that is likely to become increasingly important is clustered microprocessors with a distributed cache. This work presents a study of different cache ban...

متن کامل

Mega-KV: A Case for GPUs to Maximize the Throughput of In-Memory Key-Value Stores

In-memory key-value stores play a critical role in data processing to provide high throughput and low latency data accesses. In-memory key-value stores have several unique properties that include (1) data intensive operations demanding high memory bandwidth for fast data accesses, (2) high data parallelism and simple computing operations demanding many slim parallel computing units, and (3) a l...

متن کامل

Designs Solve the on - Chip Wire Delay Problem for Future Large Integrated Caches . by Embedding a Network in the Cache , Nuca Designs Let Data Migrate

0272-1732/03/$17.00  2003 IEEE Published by the IEEE computer Society The next generation of today’s highperformance processors incorporate large leveltwo caches on the processor die. For example, the IBM Power5 will contain a 1.92-Mbyte L2 cache, the Hewlett-Packard PA8700 will contain 2.25 Mbytes of unified on-chip cache, and the Intel Itanium2 will contain 6 Mbytes of on-chip L3 cache. Cach...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000